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Abstract. This paper describes an experimental comparison be- 
tween two standard supervised learning methods, namely Naive 
Bayes and Exemplar-based classification, on the Word Sense Disam- 
biguation (WSD) problem. The aim of the work is twofold. Firstly, 
it attempts to contribute to clarify some confusing information about 
the comparison between both methods appearing in the related lit- 
erature. In doing so, several directions have been explored, includ- 
ing: testing several modifications of the basic learning algorithms 
and varying the feature space. Secondly, an improvement of both al- 
gorithms is proposed, in order to deal with large attribute sets. This 
modification, which basically consists in using only the positive in- 
formation appearing in the examples, allows to improve greatly the 
efficiency of the methods, with no loss in accuracy. The experiments 
have been performed on the largest sense-tagged corpus available 
containing the most frequent and ambiguous English words. Results 
show that the Exemplar-based approach to WSD is generally supe- 
rior to the Bayesian approach, especially when a specific metric for 
dealing with symbolic attributes is used. 

1 INTRODUCTION 

Word Sense Disambiguation (WSD) is the problem of assigning the 
appropriate meaning (sense) to a given word in a text or discourse. 
Resolving the ambiguity of words is a central problem for language 
understanding applications and their associated tasks including, 
for instance, machine translation, information retrieval and hypertext 
navigation, parsing, speech synthesis, spelling correction, reference 
resolution, automatic text summarization, etc. 

WSD is one of the most important open problems in the Natu- 
ral Language Processing (NLP) field. Despite the wide range of ap- 
proaches investigated and the large effort devoted to tackle this prob- 
lem, it is a fact that to date no large-scale broad-coverage and highly 
accurate word sense disambiguation system has been built. 

One of the most successful current lines of research is the corpus- 
based approach in which statistical or Machine Learning (ML) al- 
gorithms have been applied to learn statistical models or classifiers 
from corpora in order to perform WSD. Generally, supervised ap- 
proaches (those that learn from a previously semantically annotated 
corpus) have obtained better results than unsupervised methods on 
small sets of selected highly ambiguous words, or artificial pseudo- 
words. Many standard ML algorithms for supervised learning have 
been applied, such as: Bayesian learning [hd, h9p. Exemplar-based 
Decision Lists [|l||. Neural Networks etc. 
J provides a comparative experiment on a very 



learning [|18|, |16H5 !• 
Further, Mooney [1 
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restricted domain between all previously cited methods but also in- 
cluding Decision Trees and Rule Induction algorithms. 

Despite the good results obtained on limited domains, supervised 
methods suffer from the lack of widely available semantically tagged 
corpora, from which to construct really broad coverage systems. This 
is known as the "knowledge acquisition bottleneck" [^]. Ng Jl7| ] es- 
timates that the manual annotation effort necessary to build a broad 
coverage semantically annotated corpus would be about 16 man- 
years. This extremely high overhead for supervision and, addition- 
ally, the also serious learning overhead when common ML algorithms 
scale to real size WSD problems, explain why supervised methods 
have been seriously questioned. 

Due to this fact, recent works have focused on reducing the acqui- 
sition cost as well as the need for supervision of corpus-based meth- 
ods for WSD. Consequently, the following three lines of research are 
currently being studied: 1) The design of efficient example sampling 
methods |Q ||]; 2) The use of lexical resources, such as WordNet |[l3|], 
and WWW search engines to automatically obtain from Internet ac- 
curate and arbitrarily large word sense samples [^, |l^]; 3) The use of 
unsupervised EM-like algorithms for estimating the statistical model 
parameters [|l9|. It is our belief that this body of work, and in partic- 
ular the second line, provide enough evidence towards the "opening" 
of the acquisition bottleneck in the near future. For that reason, it is 
worth further investigating the application of supervised ML methods 
to WSD, and thoroughly comparing existing alternatives. 



1.1 Comments about Related Work 

Unfortunately, there have been very few direct comparisons between 
alternative methods for WSD. However, it is commonly stated that 
Naive Bayes, Neural Networks and Exemplar-based learning repre- 
sent state-of-the-art accuracy on supervised WSD l[l|,[l|, ||[^. 
Regarding the comparison between Naive Bayes and Exemplar- 
based methods, the works by Mooney | p^ ] and Ng [|l^] will be the 
ones basically referred to in this paper. 

Mooney's paper shows that the Bayesian approach is clearly su- 
perior to the Exemplar-based approach. Although it is not explic- 
itly said, the overall accuracy of Naive Bayes is about 16 points 
higher than that of the Example-based algorithm, and the latter 
is only slightly above the accuracy that a Most-Frequent-Sense 
classifier would obtain. In the Exemplar-based approach, the al- 
gorithm applied for classifying new examples was a standard k- 
Nearest-Neighbour (fc-NN), using the Hamming distance for mea- 
suring closeness. Neither example weighting nor attribute weighting 
are applied, k is set to 3, and the number of attributes used is said to 
be almost 3,000. 

The second paper compares the Naive Bayes approach with Pe- 



BLS [jl|], a more sophisticated Exemplar-based learner especially de- 
signed for dealing with examples that have symbolic features. This 
paper shows that, for a large number of nearest-neighbours, the per- 
formance of both algorithms is comparable, while if cross valida- 
tion is used for parameter setting, Pebls slightly outperforms Naive 
Bayes. It has to be noted that the comparison was carried out in a 
limited setting, using only 7 features, and that the attribute/example- 
weighting facilities provided by PEBLS were not used. The author 
suggests that the poor results obtained in Mooney's work were due 
to the metric associated to the k-NN algorithm, but he did not test if 
the MVDM metric used in PEBLS is superior to the standard Ham- 
ming distance or not. 

Another surprising result that appears in Ng's paper is that the 
accuracy results obtained were 1-1.6% higher than those reported 
by the same author one year before [|l^], when running exactly the 
same algorithm on the same data, but using a larger and richer set of 
attributes. This apparently paradoxical difference is attributed, by the 
author, to the feature pruning process performed in the older paper. 

Apart from the contradictory results obtained by the previous pa- 
pers, some methodological drawbacks of both comparisons should 
also be pointed out. On the one hand, Ng applies the algorithms on 
a broad-coverage corpus but reports the accuracy results of a sin- 
gle testing experiment, providing no statistical tests of significance. 
On the other hand, Mooney performs thorough and rigorous experi- 
ments, but he compares the alternative methods on a limited domain 
consisting of a single word with a reduced set of six senses. Thus, it 
is our claim that this extremely specific domain does not guarantee 
the reaching of reliable conclusions about the relative performances 
of alternative methods when applied to broad-coverage domains. 

Consequently, the aim of this paper is twofold: 1) To study the 
source of the differences between both approaches in order to clarify 
the contradictory and incomplete information. 2) To empirically test 
the alternative algorithms and their extensions on a broad-coverage 
sense tagged corpus, in order to estimate which is the most appropri- 
ate choice. 

The paper is organized as follows: Section ^ describes the algo- 
rithms that will be tested, as well as the notation used. Section 3 is 
devoted to carefully explain the experimental setting. Section k re- 
ports the set of experiments performed and the analysis of the results 
obtained. The best alternative methods are tested on a broad coverage 
corpus in Section ^ Finally, Section ^ concludes and outlines some 
directions for future work. 

2 BASIC METHODS 
2.1 Naive Bayes 



2.2 Exemplar-Based Approach 

In our basic implementation all examples are stored in memory and 
the classification of a new example is based on a fc-NN algorithm, 
which uses Hamming distance to measure closeness (in doing so, all 
examples are examined). If k is greater than 1, the resulting sense is 
the majority sense of the k nearest neighbours. Ties are resolved in 
favour of the most frequent sense among all those tied. Hereinafter, 
this algorithm will be referred to as EB;, fc. 

In order to test some of the hypotheses about the differences be- 
tween Naive Bayes and Exemplar-based approaches, some variants 
of the basic fc-NN algorithm have been implemented: 

• Example weighting. This variant introduces a simple modifica- 
tion in the voting scheme of the k nearest neighbours, which 
makes the contribution of each example proportional to their im- 
portance. When classifying a new test example, each example of 
the set of nearest neighbours votes for its class with a weight pro- 
portional to its closeness to the test example. Hereinafter, this vari- 
ant will be referred to as EBh,k,e- 

• Attribute weighting. This variant consists of ranking all attributes 
by relevance and making them contribute to the distance calcula- 
tion with a weight proportional to their importance. The attribute 
weighting has been done using the RLM distance measure ||^. 
This measure, belonging to the distance/information-based fam- 
ilies of attribute selection functions, has been selected because it 
showed better performance than seven other alternatives in an ex- 
periment of decision tree induction for PoS tagging [pl|. Here- 
inafter, this variant will be referred to as EBh,k,a- 

When both modifications are put together, the resulting algorithm 
will be referred to as EBh,k,e,a- Finally, we have also investigated 
the effect of using an alternative metric. 

• Modified Value Difference Metric (MVDM), proposed by Cost 
and Salzberg |[l|, allows making graded guesses of the match be- 
tween two different symbolic values. Let vi and 112 be two values 
of a given attribute a. The MVDM distance between them is de- 
fined as: 



E 
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The Naive Bayes classifier has been used in its most classical set- 
ting ^ . Let Ci . . . Cm the different classes and p| Vj the set of fea- 
ture values of a test example. The Naive Bayes method tries to find 

the class that maximizes P(Ci | nuj). Assuming independence be- ^ SETTING 
tween features, the goal of the algorithm can be stated as: 



where m is the number of classes, Nx,i is the number of training 
examples with value of attribute a that are classified as class i 
in the training corpus and is the number of training examples 
with value Vx of attribute a in any class. Hereinafter, this variant 
will be referred to as EBcs,fe. This algorithm has also been used 
with the example-weighting facility (EBcs,*:,e)- 



argmax P{Ci \ CWj) ^ argmax P{Ci) I I P{vj 



where P{Ci) and P{vj \ d) are estimated during training process 
using relative frequencies. To avoid the effects of zero counts when 
estimating the conditional probabilities of the model, a very simple 
smoothing technique, proposed in Ng's paper [jl^, has been used. 
It consists in replacing zero counts of P{vj \ d) with P{Ci)/N 
where A'^ is the number of training examples. 

Hereinafter, this method will be referred to as NB. 



In our experiments, both approaches have been evaluated on the 
DSO corpus, a semantically annotated corpus containing 192,800 
occurrences of 121 nouns and 70 verb^ corresponding to the most 
frequent and ambiguous English words. This corpus was collected by 
Ng and colleagues Jla] and it is available from the Linguistic Data 
Consortium (LDC)f^ 

^ These examples, consisting of the full sentence in which the ambiguous 
word appears, are tagged with a set otjabels corresponding, with minor 
changes, to th e senses of WordNet 1.5 11311. 

^ LDC address: |http : / /www . Idc ■ upenn . edu/ 



For our first experiments, a group of 15 words (10 nouns and 5 
verbs) which frequently appear in the WSD literature has been se- 
lected. These words are described in the left hand-side of table [l| 
Since our goal is to acquire a classifier for each word, each row repre- 
sents a classification problem. The number of classes (senses) ranges 
from 4 to 30 and the number of training examples ranges from 373 
to 1,500. The MFS column of the table |l| show the percentage of 
the most frequent sense for each word, i.e. the accuracy that a naive 
"Most-Frequent-Sense" classifier would obtain. 



laoie 1. Set ot 15 reteren^ wora# Attributes 



Word 


POS 


Sens. 


Exs. 


MhS 


SETA 




age 


n 


4 


493 


62.1 


7 


3,015 


art 


n 


5 


405 


46.7 


7 


2,641 


car 


n 


5 


1,381 


95.1 


7 


4,719 


child 


n 


4 


1,068 


80.9 


7 


4,840 


cinurch 


n 


4 


373 


63.1 


7 


2,375 


cost 


n 


3 


1,500 


87.3 


7 


4,930 


fall 


V 


19 


1,500 


70.1 


7 


4,173 


head 


n 


14 


870 


36.9 


7 


4,284 


interest 


n 


7 


1,500 


45.1 


7 


5,328 


know 


V 


8 


1,500 


34.9 


7 


5,301 


line 


n 


26 


1,342 


21.9 


7 


5,813 


set 


V 


19 


1,311 


36.9 


7 


5,749 


speak 


V 


5 


517 


69.1 


7 


2,975 


take 


V 


30 


1,500 


35.6 


7 


6,428 


work 


n 


7 


1,469 


31.7 


7 


6,321 


Avg. nouns 


8.6 


1,040.1 


57.4 


7 


4,935.0 


verbs 




17.9 


1,265.6 


46.6 


7 


5.203.5 


all 




12.1 


1,115.3 


53.3 


7 


5,036.6 



Two sets of attributes have been used, which will be referred to as 
SetA and SetB, respectively. Let ". . . ws w_2 w-i w w+i w+2 
w+3 . . ." be the context of consecutive words around the word w to 
be disambiguated. Attributes refer to this context as follows. 

• SetA contains the seven following attributes: W-2, W-i, w+i, 
w+2, {w^2,w^i), (w^ijW+i), and {w+i,w+2), where the last 
three correspond to collocations of two consecutive words. These 
attributes, which are exactly those used in [|l6|, represent the lo- 
cal context of the ambiguous word and they have been proven to 
be very informative features for WSD. Note that whenever an at- 
tribute refers to a position that falls beyond the boundaries of the 
sentence for a certain example, a default value "_" is assigned. 

Let p±i be the part-of-speech tag of word ■w±i, and ci , . . . , Cm the 
unordered set of open class words appearing in the sentence. 

• SetB enriches the local context: w-i, w+i, (w-2,w-\), 

{■W-l,W+l), (w+l,W+2), (lO_3,Ul_2,TO-l), (w_2 , W-1 , W+1 ), 

w+i, W+2) and {w+i,w+2, w+3), with the part-of-speech 
information: p-s, p-2, P-i, p+i, p+2, p+3, and, additionally, it 
incorporates broad context information: c\. . .Cm- SetB is in- 
tended to represent a more realistic set of attributes for WSD[[ 
Note that c; attributes are binary-valued, denoting the the pres- 
ence or absence of a content word in the sentence context. 

The right hand-side of table |l| contains the information about the 
number of features. Note that SetA has a constant number of at- 
tributes (7), while for SetB this number depends on the concrete 
word, and that it ranges from 2,641 to 6,428. 



^ In fact, it incorporates all the attributes used in mm, with the exception of 
the morphology of the target word and the verb-oEject syntactic relation. 



4 EXPERIMENTS 

The comparison of algorithms has been performed in series of con- 
trolled experiments using exactly the same training and test sets 
for each method. The experimental methodology consisted on a 10- 
fold cross-validation. All accuracy/error rate figures appearing in 
the paper are averaged over the results of the 10 folds. The sta- 
tistical tests of significance have been performed using a 10-fold 
cross validation paired Student's t-test ||^ with a confidence value 
of: t9,o.975 = 2.262. 

Exemplar-based algorithms are run several times using different 
number of nearest neighbours (1, 3, 5, 7, 10, 15, 20 and 25) and the 
results corresponding to the best choice are reported^ 

4.1 Using SetA 

Table ^ shows the results of all methods and variants tested on the 
15 reference words, using the SetA set of attributes: Most Fre- 
quent Sense (MFS), Naive Bayes (NB), Exemplar-based using Ham- 
ming distance (EB^ variants, 5th to 9th columns), and Exemplar- 
based approach using the MVDM metric (EBcs variants, 10th to 12th 
columns) are included. The best result for each word is printed in 
boldface. From these figures, several conclusions can be drawn: 

• All methods significantly outperform the MFS classifier. 

• Referring to the EB^ variants, EBh,7 performs significantly bet- 
ter than EBij.i, confirming the results of Ng [ |l6| | that values of k 
greater than one are needed in order to achieve good performance 
with the fc-NN approach. Additionally, both example weight- 
ing (EBft,i5_e) and attribute weighting (ESh,7,a) significantly 
improve EBfi,7. Further, the combination of both (EB^^y^e.a) 
achieves an additional improvement. 

• The MVDM metric is superior to Hamming distance. The accu- 
racy of EBcs,io,e is significantly higher than those of any EB/i 
variant. Unfortunately, the use of weighted examples does not 
lead to further improvement in this case. A drawback of using 
the MVDM metric is the computational overhead introduced by 
its calculation. Table ^ shows that EB^ is fifty times faster than 
EBcs using Seta|. 

• The Exemplar-based approach achieves better results than the 
Naive Bayes algorithm. This difference is statistically significant 
when comparing the EBcs.io and EBcs,io,e against NB. 

4.2 Using SetB 

The aim of the experiments with SetB is to test both methods with a 
realistic large set of features. Table ^ summarizes the results of these 
experiments^. 

Let's now consider only NB and EB/j (3rd and 5th columns). A 
very surprising result is observed: while NB achieves almost the 
same accuracy that in the previous experiment, the exemplar-based 
approach shows a very low performance. The accuracy of EB^ drops 
8.6 points (from 6th column of table |^ to 5th column of table |3|) and 
is only slightly higher than that of MFS. 

^ In order to construct a real fc-NN-based system for WSD, the k paranieter 
should be estimated by cross-validation using only the training set [|l6[, 
however, in our case, this cross-vahdation inside the cross-validation m- 
volved in the testing process would generate a prohibitive overhead. 

^ The current programs are implemented using PERL-5.003 and they run on 
a Sun UltraSPARC-2 machine with 192Mb of RAM. 

^ Detailed results for each word are not included. 



Table 2. — Results uf all alguiUliiiis uii llie sjsjj-yful^^^^^tyice wuids using SetA. 



Word 


POS 


MhS 


NB 






tti/]^ 15 e 






bBcs 1 


ty^s 10 


ttJcs 10 e 


age 


n 


62.1 


73.8 


71.4 


69.4 


71.0 


74.4 


— 75:^ 


70.8 


73.6 


73.6 


art 


n 


46.7 


54.8 


44.2 


59.3 


58.3 


58.5 


57.0 


54.1 


59.5 


61.0 


car 


n 


95.1 


95.4 


91.3 


95.5 


95.8 


96.3 


96.2 


95.4 


96.8 


96.8 


child 


n 


80.9 


86.8 


82.3 


89.3 


89.5 


91.0 


91.2 


87.5 


91.0 


90.9 


church 


n 


61.1 


62.7 


61.9 


62.7 


63.0 


62.5 


64.1 


61.7 


64.6 


64.3 


cost 


n 


87.3 


86.7 


81.1 


87.9 


87.7 


88.1 


87.8 


82.5 


85.4 


84.7 


fall 


V 


70.1 


76.5 


73.3 


78.2 


79.0 


78.1 


79.8 


78.7 


81.6 


81.9 


head 


n 


36.9 


76.9 


70.0 


76.5 


76.9 


77.0 


78.7 


74.3 


78.6 


79.1 


interest 


n 


45.1 


64.5 


58.3 


62.4 


63.3 


64.8 


66.1 


65.1 


67.3 


67.4 


know 


V 


34.9 


47.3 


42.2 


44.3 


46.7 


44.9 


46.8 


45.1 


49.7 


50.1 


line 


n 


21.9 


51.9 


46.1 


47.1 


49.7 


50.7 


51.9 


53.3 


57.0 


56.9 


set 


V 


36.9 


55.8 


43.9 


53.0 


54.8 


52.3 


54.3 


49.7 


56.2 


56.0 


speak 


V 


69.1 


74.3 


64.6 


72.2 


73.7 


71.8 


72.9 


67.1 


72.5 


72.9 


take 


V 


35.6 


44.8 


39.3 


43.7 


46.1 


44.5 


46.0 


45.3 


48.8 


49.1 


work 


n 


31.7 


51.9 


42.5 


43.7 


47.2 


48.5 


48.9 


48.5 


52.0 


52.5 


Ave. nouns 


57.4 


71.7 


65.8 


70.0 


71.1 


72.1 


72.6 


70.6 


73.6 


73.7 


verbs 




46.6 


57.6 


51.1 


56.3 


58.1 


56.4 


58.1 


55.9 


60.3 


60.5 


all 




53.3 


66.4 


60.2 


64.8 


66.2 


66.1 


67.2 


65.0 


68.6 


68.7 



Table 3. — Results ufall alguiilliins ^(^^jefeieuee wuids using SetB. 



POS 


MhS 


NB 


PNB 


tBh_l5 


HbBh.i 


HbBh.T 


PtB;i_7^e 




Ptti;i,10,e,a 


PtBcs,! 


PtBca.lO 


PtBca,10,e 


nouns 


57.4 


72.2 


72.4 


64.3 


70.6 


72.4 


73.7 


72.5 


73.4 


73.2 


75.4 


75.6 


verbs 


46.6 


55.2 


55.3 


43.0 


54.7 


57.7 


59.5 


58.9 


60.2 


58.6 


61.9 


62.1 


all 


53.3 


65.8 


66.0 


56.2 


64.6 


66.8 


68.4 


67.4 


68.4 


67.7 


70.3 


70.5 



The problem is that the binary representation of the broad-context 
attributes is not appropriate for the fc-NN algorithm. Such a repre- 
sentation leads to an extremely sparse vector representation of the 
examples, since in each example only a few words, among all pos- 
sible, are observed. Thus, the examples are represented by a vector 
of about 5,000 O's and only a few I's. In this situation two examples 
will coincide in the majority of the values of the attributes (roughly 
speaking in "all" the zeros) and will probably differ in those positions 
corresponding to I's. This fact wrongly biases the similarity measure 
(and thus the classification) in favour of that stored examples which 
have less I's, that is, those corresponding to the shortest sentences. 

This situation could explain the poor results obtained by the fc-NN 
algorithm in Mooney's work, in which a large number of attributes 
was used. Further, it could explain why the results of Ng's system 
working with a rich attribute set (including binary-valued contex- 
tual features) were lower than those obtained with a simpler set of 
attribute;^ 

In order to address this limitation we propose to reduce the at- 
tribute space by collapsing all binary attributes ci , . . . , in a sin- 
gle set-valued attribute c that contains, for each example, all con- 
tent words that appear in the sentence. In this setting, the similar- 
ity S between two values Vi — {wi-^ , Wi^ , ■ ■ ■ , Wi„} and Vj — 
{wji , Wj2 , ■ ■ ■ , WjVn } can be redefined as: S{Vi ,Vj) = || Vi D Vj \\ , 
that is, equal to the number of words shared^ 

This approach implies that a test example is classified taking into 
account the information about the words it contains (positive infor- 
mation), but no the information about the words it does not contain. 
Besides, it allows a very efficient implementation, which will be re- 
ferred to as PEB (standing for Positive Exemplar-Based). 

In the same direction, we have tested the Naive Bayes algorithm 
combining only the conditional probabilities corresponding to the 



° Recall that authors attributed the bad results to the absence of attribute 
weighting and to the attribute pruning, respectively. 

^ This measure is usually known as the matching coefficient lioll. More com- 
plex similarity measures, e.g. Jaccard or Dice coefficientsTnave not been 
explored. 



words that appear in the test examples. This variant is referred to 
as PNB. The results of both PEB and PNB are included in table|, 
from which the following conclusions can be drawn. 

• The PEB approach reaches excellent results, improving by 10.6 
points the accuracy of EB (see 5th and 7th columns of table 
Further, the results obtained significantly outperform those ob- 
tained using SetA, indicating that the (careful) addition of richer 
attributes leads to more accurate classifiers. Additionally, the be- 
haviour of the different variants is similar to that observed when 
using SetA, with the exception that the addition of attribute- 
weighting to the example-weighting (PEBfi_io,e,a) seems no 
longer useful. 

• PNB algorithm is at least as accurate as NB. 

• Table ^ shows that the positive approach increases greatly the ef- 
ficiency of the algorithms. The acceleration factor is 80 for NB 
and 15 for EB^ (the calculation of EBcs variants was simply not 
feasible working with the attributes of SetB). 

• The comparative conclusions between the Bayesian and 
Exemplar-based approaches reached in the experiments using 
SetA also hold here. Further, the accuracy of PEB^^y^e is now 
significantly higher than that of PNB. 

Table 4. CPU-time elapsed on the set of 15 words ("hh:mm"). 

NB EBh,i5^e EBfa^7^a EBcs. 10, e 

SetA 00:07 00:08 00:11 09:56 

NB PNB EBft.iS.e PEBfa.7,e PEBfe,7,„ PEBes,10.e 

SetB 16:13 00:12 06:04 00:25 03:55 49:43 



5 GLOBAL RESULTS 

In order to ensure that the results obtained so far also hold on a re- 
alistic broad-coverage domain, the PNB and PEB algorithms have 
been tested on the whole sense-tagged corpus, using both sets of at- 
tributes. This corpus contains about 192,800 examples of 121 nouns 



and 70 verbs. The average number of senses is 7.2 for nouns, 12.6 
for verbs, and 9.2 overall. The average number of training examples 
is 933.9 for nouns, 938.7 for verbs, and 935.6 overall. 

The results obtained are presented in table ^. It has to be noted 
that the results of PEBcs using SetB were not calculated due to the 
extremely large computational effort required by the algorithm (see 
table hp. Results are coherent to those reported previously, that is: 





— 'm 

POS 


5- O'Afedmyf ^%') 'li'^ ^^^-^^tmrifne (hhimm) 


Mhy PNB PbB,i PbB„ 


PNB PbBh PbBcs 


Seta 


nouns 
verbs 
all 


56.4 68.7 68.5 70.2 
48.7 64.8 65.3 66.4 
53.2 67.1 67.2 68.6 


00:33 00:47 92:22 


SetB 


nouns 
verbs 
all 


56.4 69.2 70.1 
48.7 63.4 67.0 
53.2 66.8 68.8 


01:06 01:46 



• In SetA, the Exemplar-based approach using the MVDM metric 
is significantly superior to the rest. 

• In SetB, the Exemplar-based approach using Hamming distance 
and example weighting significantly outperforms the Bayesian ap- 
proach. Although the use of the MVDM metric could lead to better 
results, the current implementation is computationally prohibitive. 

• Contrary to the Exemplar-based approach. Naive Bayes does not 
improve accuracy when moving from SetA to SetB, that is, the 
simple addition of attributes does not guarantee accuracy improve- 
ments in the Bayesian framework. 

6 CONCLUSIONS 

This work has focused on clarifying some contradictory results 
obtained when comparing Naive Bayes and Exemplar-based ap- 
proaches to WSD. Different alternative algorithms have been tested 
using two different attribute sets on a large sense-tagged corpus. The 
experiments carried out show that Exemplar-based algorithms have 
generally better performance than Naive Bayes, when they are ex- 
tended with example/attribute weighting, richer metrics, etc. 

The reported experiments also show that the Exemplar-based ap- 
proach is very sensitive to the representation of a concrete type of 
attributes, frequently used in Natural Language problems. To avoid 
this drawback, an alternative representation of the attributes has been 
proposed and successfully tested. Furthermore, this representation 
also improves the efficiency of the algorithms, when using a large set 
of attributes. 

The test on the whole corpus allows us to estimate that, in a re- 
alistic scenario, the best tradeoff between performance and compu- 
tational requirements is achieved by using the Positive Exemplar- 
based algorithm, SetB set of attributes, Hamming distance, and 
example-weighting. 

Further research on the presented algorithms to be carried out in 
the near future includes: 1) The study of the behaviour with respect 
to the number of training examples; 2) The study of the robustness in 
the presence of highly redundant attributes; 3) The testing of the al- 
gorithms on alternative sense-tagged corpora automatically acquired 
from Internet. 
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